Assignment 4: Spatial Predictive Analysis

311 Service Requests: Street Lights - One Out

Author

Tessa Vu

Published

November 17, 2025

GOAL

Apply spatial predictive modeling techniques using the “Street Lights - One Out” 311 service request type as the predictor variable as well as other built environment variables, avoiding any demographic predictors.

The above 311 violation was selected over the others because this lab is focusing on trying to create the most “ethical” predictive model. The reason why a single street light out was determined as the most ethical is because it is the most balanced when it comes to the violations; all street lights being out or abandoned cars may likely skew toward underserved and disinvested areas.

DATASETS

1. Chicago Boundaries and Built Environment

City Limits Boundaries

Neighborhood Boundaries

Police District Boundaries

Major Roads

2. 311 Service Requests (2017)

2017 Chicago 311 Service Requests “Street Lights - One Out” Dataset

3. Crimes (2017)

2017 crimes filtered for forced entry burglaries

4. Vacant Buildings (2017)

2017 Vacant Buildings Dataset

5. Land-Use / Zoning

Land-Use / Zoning Dataset filtered for business/commercial, downtown, residential, industrial, and parks/open space filtered by creation date up to end of 2017.

6. CTA L Stations

CTA L Stations Dataset

7. Buildings (2017)

Buildings Dataset

8. Crimes for Temporal Validation (2018)

2018 crimes filtered for forced entry burglaries to match 2017 data

ANALYSIS

PART I: Data Loading & Exploration

1. Setup

# Load relevant packages.
library(tidyverse)      # Data manipulation.
library(dplyr)          # Data manipulation.
library(stringr)        # Data manipulation.
library(sf)             # Spatial operations.
library(here)           # Relative file paths.
library(viridis)        # Color scales.
library(RColorBrewer)    # Color palettes.
library(terra)          # Raster operations (replaces "raster").
library(spdep)          # Spatial dependence.
library(FNN)            # Fast nearest neighbors.
library(MASS)           # Negative binomial regression.
library(patchwork)      # Plot composition (replaces grid/gridExtra).
library(knitr)          # Tables.
library(kableExtra)     # Table formatting.
library(classInt)       # Classification intervals.
library(ggplot2)        # Plotting.
library(kableExtra)     # Professional tables.
library(car)            # Use VIF.
library(httr)           # API querying.

# Spatstat split into sub-packages.
library(spatstat.geom)    # Spatial geometries.
library(spatstat.explore) # Spatial exploration/KDE.

# Set options.
options(scipen = 999)  # No scientific notation.
set.seed(5080)         # Reproducibility.

# Create consistent themes for visualizations.
theme_chi_map <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold",
                                size = base_size + 1,
                                hjust = 0.5
                                ),
      plot.subtitle = element_text(face = "italic",
                                   color = "gray30",
                                   size = base_size - 1,
                                   hjust = 0.5
                                   ),
      legend.position = "bottom",
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank(),
      legend.title = element_text(face = "italic",
                                  size = base_size - 1,
                                  hjust = 0.5
                                  ),
      legend.title.position = "top",
      legend.text = element_text(face = "italic",
                                 size = base_size - 3,
                                 hjust = 0.5
                                 ),
      legend.key.width = unit(2, "cm"),
      legend.key.height = unit(0.5, "cm")
    )
}

2. Load Chicago Spatial Data

# Load Chicago's boundaries.
chicago_boundary <-
  st_read("data/chicago_boundary.geojson", quiet = TRUE) %>%
  st_transform("ESRI:102271")

# Load Chicago's police districts.
police_districts <-
  st_read("data/police_districts.geojson", quiet = TRUE) %>%
  st_transform("ESRI:102271")

# Load Chicago's neighborhood boundaries.
neighborhoods <-
  st_read("data/neighborhoods.geojson", quiet = TRUE) %>%
  st_transform("ESRI:102271")

# Load Chicago's major streets.
streets <-
  st_read("data/major_streets/major_streets.shp", quiet = TRUE) %>%
  st_transform("ESRI:102271")

# Load building data.
buildings <- st_read("data/buildings_2017.geojson", quiet = TRUE) %>%
  st_transform("ESRI:102271")

3. Load Burglary Data

# Load 2017 burglary data.
burglaries <-
  st_read("data/burglaries_2017.geojson", quiet = TRUE) %>%
  st_transform("ESRI:102271")

# Load 2018 burglary data.
burglaries_2018 <-
  st_read("data/burglaries_2018.geojson", quiet = TRUE) %>%
  st_transform("ESRI:102271")

4. Load 311 Data

# Load 2017 311 data for single street light outage.
light_outage <-
  st_read("data/single_light_outage_2017.geojson", quiet = TRUE) %>%
  st_transform("ESRI:102271")

5. Load Ancillary / Contextual Data

# Load vacant buildings data.
vacant <-
  st_read("data/vacant_buildings.geojson", quiet = TRUE) %>%
  st_transform("ESRI:102271")

# Load CTA L station data.
station <-
  st_read("data/L_stations.geojson", quiet = TRUE) %>%
  st_transform("ESRI:102271")

# Load zoning data.
zoning <-
  st_read("data/zoning.geojson", quiet = TRUE) %>%
  st_transform("ESRI:102271")

i. Clean Zoning Data

# Clean zoning data.
# Put different zoning sub-classes into one overall class.
zoning <- zoning %>%
  mutate(type = case_when(
    # Downtown
    str_starts(zone_class, "D") ~ "Downtown",
    
    # Residential
    str_starts(zone_class, "R")  ~ "Residential",
    
    # Business/Commercial
    str_starts(zone_class, "B|C") ~ "Business/Commercial",
    
    # Industrial
    str_starts(zone_class, "M") ~ "Industrial",
    
    # Park/Open Space
    str_starts(zone_class, "POS") ~ "Park/Open Space"
  )
)

ii. Clean Major Streets Data

# Clean major streets data.
# Merge all streets into one vector.
streets <- st_union(streets)

iii. Clean Building Data

# Clean building data.
buildings <- buildings %>%
  mutate(
    # Make sure the years are numeric.
    year_built = as.numeric(year_built),
    
    # Calculate Age for 2017.
    building_age = 2017 - year_built,
  )

It is to be noted that this lab also uses abandoned / vacant buildings as an additional contextual predictor, however, a clear distinction must be made that the dataset used is not the 311 unconfirmed complaints of vacant buildings that may be driven by emotional biases. Instead, the predictor is confirmed and issued vacant building violations. Other ancillary data include zoning, major roads, CTA’s L stations, as well as buildings to derive spatial features from. Zoning types may encourage different types of behavior like residential vs. business/commercial, major roads and CTA’s L stations create dense urban flow in some areas and provide those engaging in criminal behavior access to different areas, building density measures the urbanity of the areas and the higher the density the larger the number of potential targets, building age and its U-shape can attract burglars especially since its value alters.

6. Visualize Data

Chicago Neighborhoods (From Wikipedia)
# Burglaries.
# Point density map.
burglary_point_density <- ggplot() + 
  geom_sf(data = neighborhoods,
          fill = "gray95",
          color = "gray30"
          ) +
  geom_sf(data = burglaries,
          color = "#ff4100",
          size = 0.1,
          alpha = 0.5
          ) +
  labs(
    title = "Burglary Locations",
    subtitle = paste0("Point Density | n = ", prettyNum(nrow(burglaries), big.mark = ","))
    ) +
  theme_chi_map()

# Kernel density map.
burglary_kernel_density <- ggplot() + 
  geom_sf(data = neighborhoods,
          fill = "gray95",
          color = "gray30"
          ) +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(burglaries)),
    aes(X, Y),
    alpha = 0.6,
    bins = 8
    ) +
  scale_fill_viridis_d(
    option = "rocket",
    direction = -1,
    guide = "none"
    ) +
  labs(
    title = "Surface Density",
    subtitle = "Kernel Density | bins = 8"
    ) +
  theme_chi_map()

# Hexbin map.
burglary_hexbin <- ggplot() + 
  geom_sf(data = neighborhoods,
          fill = "gray95",
          color = "gray30"
          ) +
  geom_hex(
    data = data.frame(st_coordinates(burglaries)),
    aes(X, Y),
    binwidth = c(1000, 1000),
    bins = 30,
    alpha = 0.6
    ) +
  scale_fill_viridis_c(
    option = "rocket",
    direction = -1,
    guide = "none"
    ) +
  labs(
    title = "Hexbin Density",
    subtitle = "Hexagonal Binning | bins = 30"
    ) +
  theme_chi_map()

# Combine plots using patchwork.
burglary_point_density + burglary_hexbin + burglary_kernel_density +
  plot_annotation(
    title = "Spatial Distribution of Burglaries",
    subtitle = "Neighborhoods in Chicago, Illinois (2017)",
    theme = theme(
      plot.title = element_text(
        size = 16, 
        face = "bold",
        hjust = 0.5
        ),
      plot.subtitle = element_text(
        size = 14,
        face = "italic",
        hjust = 0.5,
        color = "gray30"
        )
      )
    )

# Street lights.
# Point density map.
light_point_density <- ggplot() + 
  geom_sf(data = neighborhoods,
          fill = "gray95",
          color = "gray30"
          ) +
  geom_sf(data = light_outage,
          color = "#ff4100",
          size = 0.1,
          alpha = 0.5
          ) +
  labs(
    title = "Single Street Light Outage Locations",
    subtitle = paste0("Point Density | n = ", prettyNum(nrow(light_outage), big.mark = ","))
    ) +
  theme_chi_map()

# Kernel density map.
light_kernel_density <- ggplot() + 
  geom_sf(data = neighborhoods,
          fill = "gray95",
          color = "gray30"
          ) +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(light_outage)),
    aes(X, Y),
    alpha = 0.6,
    bins = 8
    ) +
  scale_fill_viridis_d(
    option = "rocket",
    direction = -1,
    guide = "none"
    ) +
  labs(
    title = "Surface Density",
    subtitle = "Kernel Density | bins = 8"
    ) +
  theme_chi_map()

# Hexbin map.
light_hexbin <- ggplot() + 
  geom_sf(data = neighborhoods,
          fill = "gray95",
          color = "gray30"
          ) +
  geom_hex(
    data = data.frame(st_coordinates(light_outage)),
    aes(X, Y),
    binwidth = c(1000, 1000), # Set proportionally equal hexagons.
    bins = 30,
    alpha = 0.6
    ) +
  scale_fill_viridis_c(
    option = "rocket",
    direction = -1,
    guide = "none"
    ) +
  labs(
    title = "Hexbin Density",
    subtitle = "Hexagonal Binning | bins = 30"
    ) +
  theme_chi_map()

# Combine plots using patchwork.
light_point_density + light_hexbin + light_kernel_density +
  plot_annotation(
    title = "Spatial Distribution of 311 Requests: Single Street Light Outage",
    subtitle = "Neighborhoods in Chicago, Illinois (2017)",
    theme = theme(
      plot.title = element_text(
        size = 16, 
        face = "bold",
        hjust = 0.5
        ),
      plot.subtitle = element_text(
        size = 14,
        face = "italic",
        hjust = 0.5,
        color = "gray30"
        )
      )
    )

# Vacancies.
# Point density map.
vacant_point_density <- ggplot() + 
  geom_sf(data = neighborhoods,
          fill = "gray95",
          color = "gray30"
          ) +
  geom_sf(data = vacant,
          color = "#ff4100",
          size = 0.1,
          alpha = 0.5
          ) +
  labs(
    title = "Vacant Building Locations",
    subtitle = paste0("Point Density | n = ", prettyNum(nrow(vacant), big.mark = ","))
    ) +
  theme_chi_map()

# Kernel density map.
vacant_kernel_density <- ggplot() + 
  geom_sf(data = neighborhoods,
          fill = "gray95",
          color = "gray30"
          ) +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(vacant)),
    aes(X, Y),
    alpha = 0.6,
    bins = 8
    ) +
  scale_fill_viridis_d(
    option = "rocket",
    direction = -1,
    guide = "none"
    ) +
  labs(
    title = "Surface Density",
    subtitle = "Kernel Density | bins = 8"
    ) +
  theme_chi_map()

# Hexbin map.
vacant_hexbin <- ggplot() + 
  geom_sf(data = neighborhoods,
          fill = "gray95",
          color = "gray30"
          ) +
  geom_hex(
    data = data.frame(st_coordinates(vacant)),
    aes(X, Y),
    binwidth = c(1000, 1000), # Set proportionally equal hexagons.
    bins = 30,
    alpha = 0.6
    ) +
  scale_fill_viridis_c(
    option = "rocket",
    direction = -1,
    guide = "none"
    ) +
  labs(
    title = "Hexbin Density",
    subtitle = "Hexagonal Binning | bins = 30"
    ) +
  theme_chi_map()

# Combine plots using patchwork.
vacant_point_density + vacant_hexbin + vacant_kernel_density +
  plot_annotation(
    title = "Spatial Distribution of Vacant Buildings",
    subtitle = "Neighborhoods in Chicago, Illinois (2017)",
    theme = theme(
      plot.title = element_text(
        size = 16, 
        face = "bold",
        hjust = 0.5
        ),
      plot.subtitle = element_text(
        size = 14,
        face = "italic",
        hjust = 0.5,
        color = "gray30"
        )
      )
    )

# Major streets with neighborhoods.
neighborhood_streets_map <- ggplot() +
  geom_sf(data = neighborhoods,
          fill = "gray95",
          color = NA
          ) +
  geom_sf(data = streets,
          color = "#F03E36",
          linewidth = 0.4
          ) +
  geom_sf(data = neighborhoods,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75
          ) +
  labs(title = "With Neighborhoods") +
  theme_chi_map()

# Major streets with police districts.
district_streets_map <- ggplot() +
  geom_sf(data = police_districts,
          fill = "gray95",
          color = NA
          ) +
  geom_sf(data = streets,
          color = "#F03E36",
          linewidth = 0.4
          ) +
  geom_sf(data = police_districts,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75
          ) +
  labs(title = "With Police Districts") +
  theme_chi_map()

# Combine plots using patchwork.
neighborhood_streets_map + district_streets_map +
  plot_annotation(
    title = "Major Streets Network",
    subtitle = "Chicago, Illinois",
    theme = theme(
      plot.title = element_text(
        size = 16, 
        face = "bold",
        hjust = 0.5
        ),
      plot.subtitle = element_text(
        size = 14,
        face = "italic",
        hjust = 0.5,
        color = "gray30"
        )
      )
    )

7. Section Analysis

Part I goes over the initial setup for the entirety of the lab as well as the data cleaning process that involved categorizing the land-use and creating new variables like building age and unifying the street vector data, the latter of which was at first overlooked until it came time to spatially engineer the distances to major roads—computationally less heavy and more efficient to compute distances to a single vector rather than thousands of line segments across Chicago.

When it came to zoning, land-use often has multiple sub-classes and varies by city, with Chicago possessing over fifty sub-classes, which were then categorized into a broader level that includes downtown, residential, industrial, parks, and businesses.

The EDA portion in this section is especially important, providing a mid-depth understanding of the spatial distribution of different variables.

Burglaries (7,509) had a lot of density in Chicago’s central-north (regionally known as West Side), central-south (regionally known as Southwest Side), and southeastern (regionally known as South Side) areas. The neighborhoods at the darkest cores of the maps are West Town, Chicago Lawn, West Englewood, Woodlawn, Greater Grand Crossing, and South Shore.

When taking a look at the single street light outage’s (75,084) spatial distribution, it confirms the assumption that this specific 311 complaint would be the most “ethical” as the densities span a wider portion of Chicago.

Vacant buildings (5,012) are more sparse with a lot of density in West Side’s Humboldt Park and Southwest Side’s New City, West Englewood, and Chicago Lawn.

Major streets are mapped alongside one another with the neighborhood and police district boundaries overlaid. It’s pretty clear from the street networks that they line up with CTA’s transit lines and also show where the urban core is in Central’s Loop neighborhood boundary.

PART II: Fishnet Grid Creation

1. Fishnet Grid Cells

# Burglaries.
# Create 500m x 500m grid.
fishnet <- st_make_grid(
  neighborhoods,
  cellsize = 500,
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago.
fishnet <- fishnet[neighborhoods, ]

2. Aggregate Datasets

# Spatial join.
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(count_burglaries = n())

# Join back to fishnet (cells with 0 requests will be NA).
fishnet <- fishnet %>%
  left_join(burglaries_fishnet, by = "uniqueID") %>%
  mutate(count_burglaries = replace_na(count_burglaries, 0))

# Spatial join.
burglaries_2018_fishnet <- st_join(burglaries_2018, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(count_burglaries_2018 = n())

# Join back to fishnet (cells with 0 requests will be NA).
fishnet <- fishnet %>%
  left_join(burglaries_2018_fishnet, by = "uniqueID") %>%
  mutate(count_burglaries_2018 = replace_na(count_burglaries_2018, 0))
# Spatial join.
light_fishnet <- st_join(light_outage, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(count_outages = n())

# Join back to fishnet (cells with 0 requests will be NA).
fishnet <- fishnet %>%
  left_join(light_fishnet, by = "uniqueID") %>%
  mutate(count_outages = replace_na(count_outages, 0))
# Aggregate vacant buildings.
# Spatial join.
vacant_fishnet <- st_join(vacant, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(count_vacant = n())

# Join back to fishnet (cells with 0 requests will be NA).
fishnet <- fishnet %>%
  left_join(vacant_fishnet, by = "uniqueID") %>%
  mutate(count_vacant = replace_na(count_vacant, 0))
# Aggregate CTA L stations.
# Spatial join.
station_fishnet <- st_join(station, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(count_station = n())

# Join back to fishnet (cells with 0 requests will be NA).
fishnet <- fishnet %>%
  left_join(station_fishnet, by = "uniqueID") %>%
  mutate(count_station = replace_na(count_station, 0))
# Calculate zoning percentages.
# Get the area of each fishnet cell.
fishnet_zoning_area <- fishnet %>%
  mutate(cell_area = st_area(.)) %>%
  st_drop_geometry() %>%
  dplyr::select(uniqueID, cell_area) # Conflict with another library's select function.

# Intersect zoning and fishnet, calculate area, and group by the zoning "type".
zoning_intersect <- st_intersection(fishnet, zoning) %>%
  mutate(intersect_area = st_area(.)) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, type) %>%
  summarize(total_intersect_area = sum(intersect_area), .groups = "drop") %>%
  left_join(fishnet_zoning_area, by = "uniqueID") %>%
  mutate(percent_area_zoning = as.numeric(total_intersect_area / cell_area))

# Pivot to get one row per grid cell and one column per "type".
zoning_features <- zoning_intersect %>%
  pivot_wider(
    id_cols = uniqueID,
    names_from = type,
    values_from = percent_area_zoning,
    values_fill = 0
  )

# Rename features because the slashes will error the regression.
zoning_features <- zoning_features %>%
  rename(
    pct_Residential = Residential,
    pct_Business_Commercial = `Business/Commercial`,
    pct_Park_Open_Space = `Park/Open Space`,
    pct_Industrial = Industrial,
    pct_Downtown = Downtown
  )

# Join back to fishnet and replace all NAs with 0.
fishnet <- fishnet %>%
  left_join(zoning_features, by = "uniqueID") %>%
  mutate(across(starts_with("pct_"), ~replace_na(., 0)))
# Calculate distance from cell to streets.
dist_matrix <- st_distance(fishnet, streets)

# Join to fishnet.
fishnet <- fishnet %>%
  mutate(
    dist_to_streets = as.numeric(dist_matrix[, 1])
  )
# Calculate building percentages.
# Get the area of each fishnet cell.
fishnet_building_area <- fishnet %>%
  mutate(cell_area = as.numeric(st_area(.))) %>%
  st_drop_geometry() %>%
  dplyr::select(uniqueID, cell_area) # Conflict with another library's select function.

# Intersect buildings and fishnet and calculate area.
building_features <- st_intersection(fishnet, buildings) %>%
  mutate(intersect_area = as.numeric(st_area(.))) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(
    avg_building_age = mean(building_age),
    total_building_area = sum(intersect_area),
    avg_building_area = mean(intersect_area),
    .groups = "drop"
    )

# Join back to fishnet and replace all NAs with 0.
fishnet <- fishnet %>%
  left_join(building_features, by = "uniqueID") %>%
  left_join(fishnet_building_area, by = "uniqueID") %>%
  mutate(
    building_density = total_building_area / cell_area,
    avg_building_age = replace_na(avg_building_age, 0),
    building_density = replace_na(building_density, 0),
    avg_building_area = replace_na(avg_building_area, 0)
  )

3. Visualize Spatial Distributions

# Burglaries.
# Visualize aggregated counts.
burglary_fishnet_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = count_burglaries),
          color = NA
          ) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75
          ) +
  scale_fill_viridis_c(
    name = "Burglaries",
    option = "rocket",
    direction = -1,
    trans = "sqrt",
    breaks = c(0, 1, 2, 5, 10, 20, 35)
    ) +
  labs(
    title = "Burglary Counts"
    ) +
  theme_chi_map()

# Street lights.
# Visualize aggregated counts.
light_fishnet_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = count_outages),
          color = NA
          ) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75
          ) +
  scale_fill_viridis_c(
    name = "Outages",
    option = "rocket",
    direction = -1,
    trans = "sqrt",
    breaks = c(0, 5, 20, 50, 100, 160, 255)
    ) +
  labs(
    title = "Single Street Light Outages"
    ) +
  theme_chi_map()

# Combine plots using patchwork.
burglary_fishnet_map + light_fishnet_map +
  plot_annotation(
    title = "Spatial Distribution of Burglaries and 311 Requests",
    subtitle = "500m x 500m Fishnet Grid Cells in Chicago, Illinois (2017)",
    theme = theme(
      plot.title = element_text(
        size = 16, 
        face = "bold",
        hjust = 0.5
        ),
      plot.subtitle = element_text(
        size = 14,
        face = "italic",
        hjust = 0.5,
        color = "gray30"
        )
      )
    )

# Vacancies.
# Visualize aggregated counts.
vacant_fishnet_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = count_vacant),
          color = NA
          ) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75
          ) +
  scale_fill_viridis_c(
    name = "Vacancies",
    option = "rocket",
    direction = -1,
    trans = "sqrt",
    breaks = c(0, 1, 5, 10, 25, 50, 83)
    ) +
  labs(
    title = "Vacancy Counts"
    ) +
  theme_chi_map()

# Stations.
# Visualize aggregated counts.
station_fishnet_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = count_station),
          color = NA
          ) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75
          ) +
  scale_fill_viridis_c(
    name = "Stations",
    option = "rocket",
    direction = -1,
    trans = "sqrt"
    ) +
  labs(
    title = "CTA L Station Counts"
    ) +
  theme_chi_map()

# Combine plots using patchwork.
vacant_fishnet_map + station_fishnet_map +
  plot_annotation(
    title = "Spatial Distribution of Vacant Buildings and CTA L Stations",
    subtitle = "500m x 500m Fishnet Grid Cells in Chicago, Illinois (2017)",
    theme = theme(
      plot.title = element_text(
        size = 16, 
        face = "bold",
        hjust = 0.5
        ),
      plot.subtitle = element_text(
        size = 14,
        face = "italic",
        hjust = 0.5,
        color = "gray30"
        )
      )
    )

# Buildings.
# Visualize average age.
building_age_fishnet_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = avg_building_age),
          color = NA
          ) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75
          ) +
  scale_fill_viridis_c(
    name = "Average Building Age (Years)",
    option = "rocket",
    direction = -1,
    trans = "sqrt",
    breaks = c(0, 25, 50, 75, 100, 125),
    labels = function(x) paste0(x, "yrs")
    ) +
  labs(
    title = "Average Building Age"
    ) +
  theme_chi_map()

# Visualize density.
building_density_fishnet_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = building_density),
          color = NA
          ) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75
          ) +
  scale_fill_viridis_c(
    name = "Building Density (Percent %)",
    option = "rocket",
    direction = -1,
    trans = "sqrt",
    breaks = c(0, 0.05, 0.10, 0.20, 0.30, 0.50),
    labels = scales::percent_format(accuracy = 1) # Conflict with another package.
    ) +
  labs(
    title = "Building Density"
    ) +
  theme_chi_map()

# Combine plots using patchwork.
building_age_fishnet_map + building_density_fishnet_map +
  plot_annotation(
    title = "Spatial Distribution of Building Age and Density",
    subtitle = "500m x 500m Fishnet Grid Cells in Chicago, Illinois (2017)",
    theme = theme(
      plot.title = element_text(
        size = 16, 
        face = "bold",
        hjust = 0.5
        ),
      plot.subtitle = element_text(
        size = 14,
        face = "italic",
        hjust = 0.5,
        color = "gray30"
        )
      )
    )

# Create individual maps for each zoning type.
residential_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = pct_Residential),
          color = NA) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75) +
  scale_fill_viridis_c(
    name = "Percent (%)",
    option = "rocket",
    direction = -1,
    labels = scales::percent,
    limits = c(0, 1)
  ) +
  labs(title = "Residential") +
  theme_chi_map()

biz_com_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = pct_Business_Commercial),
          color = NA) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75) +
  scale_fill_viridis_c(
    name = "Percent (%)",
    option = "rocket",
    direction = -1,
    labels = scales::percent,
    limits = c(0, 1)
  ) +
  labs(title = "Business/Commercial") +
  theme_chi_map()

park_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = pct_Park_Open_Space),
          color = NA) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75) +
  scale_fill_viridis_c(
    name = "Percent (%)",
    option = "rocket",
    direction = -1,
    labels = scales::percent,
    limits = c(0, 1)
  ) +
  labs(title = "Park/Open Space") +
  theme_chi_map()

industrial_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = pct_Industrial),
          color = NA) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75) +
  scale_fill_viridis_c(
    name = "Percent (%)",
    option = "rocket",
    direction = -1,
    labels = scales::percent,
    limits = c(0, 1)
  ) +
  labs(title = "Industrial") +
  theme_chi_map()

downtown_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = pct_Downtown),
          color = NA) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75) +
  scale_fill_viridis_c(
    name = "Percent (%)",
    option = "rocket",
    direction = -1,
    labels = scales::percent,
    limits = c(0, 1)
  ) +
  labs(title = "Downtown") +
  theme_chi_map()

# Combine with patchwork.
combined_map <- (residential_map + biz_com_map + park_map) /
                (industrial_map + downtown_map) +
  plot_annotation(
    title = "Zoning Spatial Distribution by Type",
    subtitle = "500m x 500m Fishnet Grid Cells in Chicago, Illinois (2017)",
    theme = theme(
      plot.title = element_text(
        size = 16, 
        face = "bold",
        hjust = 0.5
        ),
      plot.subtitle = element_text(
        size = 14,
        face = "italic",
        hjust = 0.5,
        color = "gray30"
        )
      )
    ) +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")
combined_map

4. Kernel Density Estimation (KDE) Baseline

# Convert burglaries to ppp (point pattern) format for spatstat.
burglaries_ppp <- as.ppp(
  st_coordinates(burglaries),
  W = as.owin(st_bbox(chicago_boundary))
)

# Calculate KDE with 1km bandwidth.
kde_burglaries <- density.ppp(
  burglaries_ppp,
  sigma = 1000,  # 1km bandwidth.
  edge = TRUE    # Edge correction.
)

# Convert to terra raster (modern approach, not raster::raster).
kde_raster <- rast(kde_burglaries)

# Extract KDE values to fishnet cells.
fishnet <- fishnet %>%
  mutate(
    kde_value_burglaries = terra::extract(
      kde_raster,
      vect(fishnet),
      fun = mean,
      na.rm = TRUE
    )[, 2]  # Extract just the values column.
  )
# Burglaries.
burglary_kde_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = kde_value_burglaries),
          color = NA
          ) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "#21918c",
          linewidth = 0.75
          ) +
  scale_fill_viridis_c(
    name = "KDE Value (Burglaries per m²)",
    option = "rocket",
    direction = -1,
    breaks = c(0, 0.00001, 0.00002, 0.00003, 0.000035),
    labels = scales::label_number(accuracy = 0.000001) # Package conflict.
    ) +
  labs(
    title = "Kernel Density Estimation (KDE): Burglary Locations",
    subtitle = "Simple Spatial Smoothing | Chicago, Illinois (2017)"
    ) +
  theme_chi_map()

burglary_kde_map

5. Section Analysis

Part II deals with the 500 meter by 500 meter fishnet grid creation to get a more uniform look at Chicago as opposed to block groups or census tracts, which have different areas and also very different urban and population characteristics.

This makes raw counts like burglaries comparable to one another as opposed to comparing them over wildly different areas (e.g. 5 burglaries in the state of Utah vs. 5 burglaries in a city block, although rarely ever that intense) as with the well-known Modifiable Areal Unit Problem (MAUP).

Burglary and light outage counts were created in this section along with calculating the distance from each cell to the nearest major street line, now unified from the previous section to easily compute it.

The zoning portion was significantly more involved conceptually and technically, the process calculated how much each zoning type intersected with a cell. It was determined this would be better than counting, because a miniscule industrial corner being counted as 1 unit for a cell wasn’t as conceptually sound as creating a percentage depending on the polygon overlap.

The fishnet spatial distributions for burglaries, outages, and vacancies are largely similar to the previous section’s maps when looking at the overall density, however, the fishnet maps are more coarse due to the resolution that was set.

CTA’s L stations will be single dots of course due to the nature of the dataset, but it does overlay with the major streets spatially.

Looking at the building age, it’s clear that Chicago is a very old city. The central Loop looks to have some spots of newer buildings likely due to the fact it’s the urban core with business hubs that may have more demand for new infill buildings.

Then there’s a subtle but noticeable dark ring that surrounds the central Loop with buildings generally over 100 years old, and even further out toward the periphery are rings of buildings around perhaps 40 to 75 years old. Building density also spatially follows the same pattern being from the same dataset, but it does look like density diminishes going farther west and south of the center.

The very light spot in the south is due to Big Marsh Park, which has a lot of greenery and a lake. Then there’s the bright spot in the northwest, which is O’Hare International Airport.

Lots of residential zoning outside the downtown zoning, and the business/commercial zones spatially overlap with the residential one. Note that there are indeed business/commercial and residential zoning in the city center, but that was alloted to the downtown zoning. Parks are spotted throughout Chicago. Industrial zones are more on the peripheral or spotty going inland.

The KDE burglary map shows the value for burglaries per meter squared and has a similar spatial pattern in Part I’s burglary EDA visuals.

PART III: Spatial Features

1. Nearest Neighbor: CTA L Stations

# Calculate mean distance to 3 nearest CTA L stations.

# Get coordinates.
fishnet_station_coords <- st_coordinates(st_centroid(fishnet))
station_coords <- st_coordinates(station) %>%
  na.omit(station_coords)

# Calculate k nearest neighbors and distances.
nn_station_result <- get.knnx(station_coords, fishnet_station_coords, k = 3)

# Add to fishnet.
fishnet <- fishnet %>%
  mutate(
    nn_station = rowMeans(nn_station_result$nn.dist)
  )

2. Hotspot Distance: Single Street Light Outages

# Function to calculate Local Moran's I.
calculate_local_morans <- function(data, variable, k = 5) {
  
  # Create spatial weights.
  coords <- st_coordinates(st_centroid(data))
  neighbors <- knn2nb(knearneigh(coords, k = k))
  weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
  
  # Calculate Local Moran's I.
  local_moran <- localmoran(data[[variable]], weights)
  
  # Classify clusters.
  mean_val <- mean(data[[variable]], na.rm = TRUE)
  
  data %>%
    mutate(
      local_i = local_moran[, 1],
      p_value = local_moran[, 5],
      is_significant = p_value < 0.05,
      
      moran_class = case_when(
        !is_significant ~ "Not Significant",
        local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
        local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
        local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
        local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
        TRUE ~ "Not Significant"
      )
    )
}
# Apply to single street light outages.
fishnet <- calculate_local_morans(fishnet, "count_outages", k = 5)

fishnet <- fishnet %>%
  rename(local_i_outages = local_i,
         p_value_outages = p_value,
         moran_outages = moran_class,
         is_significant_outages = is_significant
         )
# Street lights.
# Visualize hot spots.
light_lisa_map <- ggplot() +
  geom_sf(
    data = fishnet, 
    aes(fill = moran_outages), 
    color = NA
    ) +
  geom_sf(data = chicago_boundary,
          fill = NA,
          color = "gray60",
          linewidth = 0.75
          ) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",
      "High-Low" = "#fdae61",
      "Low-High" = "#abd9e9",
      "Low-Low" = "#2c7bb6",
      "Not Significant" = "gray90"
      ),
    name = "Cluster Type"
    ) +
  labs(
    title = "Moran's I: Single Street Light Outage",
    ) +
  theme_chi_map() +
  theme(
    legend.position = "right",
    legend.text = element_text(hjust = 0)
    )

light_lisa_map

# Street lights.
# Get centroids of "High-High" cells (hot spots).
outage_hotspots <- fishnet %>%
  filter(moran_outages == "High-High") %>%
  st_centroid()

# Calculate distance from each cell to nearest hot spot.
if (nrow(outage_hotspots) > 0) {
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot_outages = as.numeric(
        st_distance(st_centroid(fishnet), outage_hotspots %>% st_union())
      )
    )
} else {
  fishnet <- fishnet %>%
    mutate(dist_to_hotspot_outages = 0)
  cat("\u26A0 No significant hot spots found.\n")
}

3. Join Block Groups for Cross-Validation

# Join district information to fishnet.
fishnet <- st_join(
  fishnet,
  police_districts,
  join = st_within,
  left = TRUE
) %>%
  filter(!is.na(dist_label)) # Remove cells outside districts.

# Join neighborhood information to fishnet.
fishnet <- st_join(
  fishnet,
  neighborhoods,
  join = st_within,
  left = TRUE
) %>%
  filter(!is.na(community)) # Remove cells outside neighborhoods

4. Section Analysis

Part III is where the spatial features were engineered to produce predictor variables for the model. The spatial predictors are more sophisticated and complex in comparison to using counts of urban data. K-Nearest Neighbor feature was created from the stations with k = 3 for a good balance in a large city with a more robust and expansive public transit system compared to others. This is much more reflective of everyday life than distance to the nearest station as different stations may have differing routes for individuals.

Hotspot distances were also calculated from the outages, this is also more reflective of everyday life because individuals may feel more unsafe in areas with more light outages, and burglars may find more opportunity in areas with more outages.

The LISA map shows that there are only high-high and low-low clusters, and low-high outliers, but not high-low outliers. It looks like the hot clusters and sprinkled outliers are sprawled across Northwest Side, West Side, South Side, and Southwest Side. Cold clusters are in Far Southeast Side, in Big Marsh Park, and Far North Side, specifically the airport.

This section also saw to it that the neighborhoods and police districts were joined into the fishnet variable. The reasoning is explained in Part V-3.

PART IV: Count Regression Models

1. Poisson Regression

# Create clean modeling dataset.
fishnet_model <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    uniqueID,
    dist_label,
    community,
    count_burglaries,
    count_vacant,
    dist_to_hotspot_outages,
    nn_station,
    pct_Business_Commercial,
    pct_Downtown,
    pct_Industrial,
    pct_Park_Open_Space,
    pct_Residential,
    dist_to_streets,
    avg_building_age,
    building_density
  ) %>%
  na.omit() # Remove any remaining NAs.
# Create summary statistics.
summary_stats <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    count_burglaries, 
    dist_to_hotspot_outages, 
    count_vacant, 
    nn_station, 
    dist_to_streets, 
    building_density, 
    avg_building_age, 
    pct_Business_Commercial, 
    pct_Downtown, 
    pct_Industrial, 
    pct_Park_Open_Space
  ) %>%
  summarise(across(everything(), 
                   list(
                     Min = ~min(., na.rm = TRUE),
                     Q1 = ~quantile(., 0.25, na.rm = TRUE),
                     Median = ~median(., na.rm = TRUE),
                     Mean = ~mean(., na.rm = TRUE),
                     Q3 = ~quantile(., 0.75, na.rm = TRUE),
                     Max = ~max(., na.rm = TRUE)
                   ))) %>%
  pivot_longer(
    everything(), 
    names_to = c("Variable", "Statistic"),
    names_pattern = "(.*)_(Min|Q1|Median|Mean|Q3|Max)$" # Make the pattern permanent, or else it'll mix up the order.
  ) %>%
  pivot_wider(names_from = Statistic, values_from = value)

# Rename to make the table more neat and professional. Mark percentages for nicer value labeling.
summary_stats <- summary_stats %>%
  mutate(
    # Flag percent variables to format the values later.
    is_percent = grepl("^pct_", Variable) | Variable == "building_density",
    Variable = case_when(
      Variable == "count_burglaries" ~ "Burglary Count",
      Variable == "dist_to_hotspot_outages" ~ "Distance to Outage Hotspots (Meters)",
      Variable == "count_vacant" ~ "Vacancy Count",
      Variable == "nn_station" ~ "Distance to Nearest Station (Meters)",
      Variable == "dist_to_streets" ~ "Distance to Major Street (Meters)",
      Variable == "building_density" ~ "Building Density (Percent %)",
      Variable == "avg_building_age" ~ "Average Building Age",
      Variable == "pct_Business_Commercial" ~ "% Business/Commercial",
      Variable == "pct_Downtown" ~ "% Downtown",
      Variable == "pct_Industrial" ~ "% Industrial",
      Variable == "pct_Park_Open_Space" ~ "% Park/Open Space",
      TRUE ~ Variable
    )
  )

# Format the values in the table.
summary_stats <- summary_stats %>%
  mutate(
    is_distance = grepl("Distance", Variable),  # Flag distance variables
    across(c(Min, Q1, Median, Mean, Q3, Max), 
           ~case_when(
             is_percent ~ paste0(format(round(. * 100, 2), big.mark = ","), "%"),
             is_distance ~ paste0(format(round(., 2), big.mark = ","), "m"),
             TRUE ~ format(round(., 2), big.mark = ",")
           ))
  ) %>%
  dplyr::select(-is_percent, -is_distance) # Remove the flag columns so they don't show up in the Kable.

# Create Kable.
summary_stats %>%
  mutate(Variable = cell_spec(Variable, bold = TRUE)) %>%
  kable(
    format = "html", # HTML formatting for visualization flexibility.
    escape = FALSE,
    caption = "Summary Statistics of Final Variables",
    col.names = c("Variable", "Minimum", "1st Quartile", "Median", "Mean", "3rd Quartile", "Maximum"),
    align = c("l", rep("c", 6)) # Left-align left column, repeat center-align for next 6 columns.
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"), # Make the large Kable visually compact.
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#21918c", align = "c")
Summary Statistics of Final Variables
Variable Minimum 1st Quartile Median Mean 3rd Quartile Maximum
Burglary Count 0.00 0.00 2.00 3.41 5.00 34.00
Distance to Outage Hotspots (Meters) 0.00m 500.00m 1,414.21m 2,322.97m 3,041.38m 13,124.40m
Vacancy Count 0.00 0.00 0.00 2.33 1.00 68.00
Distance to Nearest Station (Meters) 147.21m 1,339.27m 2,591.07m 3,349.37m 4,702.88m 12,296.05m
Distance to Major Street (Meters) 0.00m 0.00m 0.00m 61.42m 0.00m 2,264.87m
Building Density (Percent %) 0.0% 6.89% 14.43% 12.94% 18.94% 56.03%
Average Building Age 0.00 62.62 82.97 73.48 100.75 127.74
% Business/Commercial 0.0% 0.00% 3.24% 8.26% 14.81% 50.89%
% Downtown 0.0% 0.00% 0.00% 0.79% 0.00% 77.77%
% Industrial 0.0% 0.00% 0.00% 4.27% 0.60% 100.00%
% Park/Open Space 0.0% 0.00% 0.00% 5.58% 2.04% 100.00%
# Fit Poisson regression.
model_poisson <- glm(count_burglaries ~
                       # 311 data.
                       dist_to_hotspot_outages +
                       # Contextual count data.
                       count_vacant +
                       # Contextual spatial features.
                       nn_station +  dist_to_streets + building_density +
                       # Age polynomial.
                       avg_building_age + I(avg_building_age^2) +
                       # Zoning. Leave residential out as reference.
                       pct_Business_Commercial + pct_Downtown + pct_Industrial + pct_Park_Open_Space,
                     data = fishnet_model,
                     family = "poisson"
                     )

# Summary.
summary(model_poisson)

Call:
glm(formula = count_burglaries ~ dist_to_hotspot_outages + count_vacant + 
    nn_station + dist_to_streets + building_density + avg_building_age + 
    I(avg_building_age^2) + pct_Business_Commercial + pct_Downtown + 
    pct_Industrial + pct_Park_Open_Space, family = "poisson", 
    data = fishnet_model)

Coefficients:
                            Estimate   Std. Error z value             Pr(>|z|)
(Intercept)             -3.073808634  0.333799112  -9.209 < 0.0000000000000002
dist_to_hotspot_outages -0.000180071  0.000013223 -13.618 < 0.0000000000000002
count_vacant             0.008573989  0.001454178   5.896     0.00000000372181
nn_station              -0.000002367  0.000009780  -0.242             0.808775
dist_to_streets         -0.001445527  0.000647913  -2.231             0.025678
building_density         2.728651596  0.289326230   9.431 < 0.0000000000000002
avg_building_age         0.087348829  0.007560226  11.554 < 0.0000000000000002
I(avg_building_age^2)   -0.000437146  0.000042510 -10.283 < 0.0000000000000002
pct_Business_Commercial  1.039951228  0.146905295   7.079     0.00000000000145
pct_Downtown             0.725126200  0.182070394   3.983     0.00006814549675
pct_Industrial          -0.973508683  0.192591788  -5.055     0.00000043089228
pct_Park_Open_Space     -0.754484807  0.204272772  -3.694             0.000221
                           
(Intercept)             ***
dist_to_hotspot_outages ***
count_vacant            ***
nn_station                 
dist_to_streets         *  
building_density        ***
avg_building_age        ***
I(avg_building_age^2)   ***
pct_Business_Commercial ***
pct_Downtown            ***
pct_Industrial          ***
pct_Park_Open_Space     ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5617.1  on 1276  degrees of freedom
Residual deviance: 2925.3  on 1265  degrees of freedom
AIC: 5772.5

Number of Fisher Scoring iterations: 6

2. Check for Overdispersion

# Calculate dispersion parameter.
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) / 
              model_poisson$df.residual

3. Negative Binomial Distribution

# Fit Negative Binomial model
model_nb <- glm.nb(count_burglaries ~
                     # 311 data.
                     dist_to_hotspot_outages +
                     # Contextual count data.
                     count_vacant +
                     # Contextual spatial features.
                     dist_to_streets + building_density +
                     # Age polynomial.
                     avg_building_age + I(avg_building_age^2) +
                     # Zoning. Leave residential out as reference.
                     pct_Business_Commercial + pct_Downtown + pct_Industrial + pct_Park_Open_Space,
                     data = fishnet_model,
                   )

# Summary
summary(model_nb)

Call:
glm.nb(formula = count_burglaries ~ dist_to_hotspot_outages + 
    count_vacant + dist_to_streets + building_density + avg_building_age + 
    I(avg_building_age^2) + pct_Business_Commercial + pct_Downtown + 
    pct_Industrial + pct_Park_Open_Space, data = fishnet_model, 
    init.theta = 2.252719149, link = log)

Coefficients:
                           Estimate  Std. Error z value             Pr(>|z|)
(Intercept)             -2.93989971  0.40976071  -7.175  0.00000000000072479
dist_to_hotspot_outages -0.00016227  0.00001978  -8.205  0.00000000000000023
count_vacant             0.01088869  0.00305940   3.559             0.000372
dist_to_streets         -0.00191060  0.00101168  -1.889             0.058953
building_density         3.22875110  0.47301016   6.826  0.00000000000873359
avg_building_age         0.08027292  0.00965518   8.314 < 0.0000000000000002
I(avg_building_age^2)   -0.00039559  0.00005593  -7.073  0.00000000000151664
pct_Business_Commercial  1.24510516  0.26453601   4.707  0.00000251695918053
pct_Downtown             0.84752237  0.34966093   2.424             0.015357
pct_Industrial          -0.81280466  0.26579963  -3.058             0.002228
pct_Park_Open_Space     -0.55988433  0.28484747  -1.966             0.049350
                           
(Intercept)             ***
dist_to_hotspot_outages ***
count_vacant            ***
dist_to_streets         .  
building_density        ***
avg_building_age        ***
I(avg_building_age^2)   ***
pct_Business_Commercial ***
pct_Downtown            *  
pct_Industrial          ** 
pct_Park_Open_Space     *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.2527) family taken to be 1)

    Null deviance: 2584.1  on 1276  degrees of freedom
Residual deviance: 1254.4  on 1266  degrees of freedom
AIC: 5055.3

Number of Fisher Scoring iterations: 1

              Theta:  2.253 
          Std. Err.:  0.172 

 2 x log-likelihood:  -5031.330 
# Create new dataframe for Kable.
model_comparison_stats <- tibble(
  Statistic = c(
    "Poisson Model AIC",
    "Negative Binomial AIC",
    "Poisson Dispersion"
  ),
  Value = c(
    round(AIC(model_poisson), 2),
    round(AIC(model_nb), 2),
    round(dispersion, 2)
  )
)

# Format so that there are thousand separators.
model_comparison_stats <- model_comparison_stats %>%
  mutate(
    Value = prettyNum(Value, big.mark = ",")
  )

# Create Kable for AIC comparison.
model_comparison_stats %>%
  mutate(
    Statistic = cell_spec(Statistic, bold = TRUE)
  ) %>%
  kable(
    format = "html", # HTML formatting for visualization flexibility.
    escape = FALSE,
    caption = "Akaike Information Criteria (AIC) and Dispersion",
    align = c("l", "c"),
    row.names = FALSE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"), # Make the Kable visually compact.
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#21918c", align = "c") %>%
  footnote(
    general = "Dispersion > 1.5 suggests overdispersion.\nLower AIC is better.",
    general_title = "Note:"
  )
Akaike Information Criteria (AIC) and Dispersion
Statistic Value
Poisson Model AIC 5,772.46
Negative Binomial AIC 5,055.33
Poisson Dispersion 2.72
Note:
Dispersion > 1.5 suggests overdispersion.
Lower AIC is better.

4. VIF Test

# VIF test for the Negative Binomial model.
vif_results <- vif(model_nb)

# VIF to a data frame.
vif_table <- data.frame(
  Variable = names(vif_results),
  VIF = vif_results
) %>%
  arrange(desc(VIF)) %>% # Descending sort VIF values.
  mutate(
    Concern = case_when(
      VIF > 10 ~ "High",
      VIF > 5 ~ "Moderate",
      TRUE ~ "Low"
    )
  )

# Rename to make the table more neat and professional.
vif_table <- vif_table %>%
  mutate(Variable = case_when(
    Variable == "count_vacant" ~ "Vacancy Count",
    Variable == "dist_to_streets" ~ "Distance to Major Street",
    Variable == "building_density" ~ "Building Density",
    Variable == "I(avg_building_age^2)" ~ "Average Building Age²",
    Variable == "avg_building_age" ~ "Average Building Age",
    Variable == "pct_Business_Commercial" ~ "% Business/Commercial",
    Variable == "pct_Downtown" ~ "% Downtown",
    Variable == "pct_Industrial" ~ "% Industrial",
    Variable == "pct_Park_Open_Space" ~ "% Park/Open Space",
    Variable == "dist_to_hotspot_outages" ~ "Distance to Outage Hotspots",
    TRUE ~ Variable
    )
  )

# Create the Kable.
vif_table %>%
  mutate(Variable = cell_spec(Variable, bold = TRUE)) %>%
  kable(
    format = "html", # HTML formatting for visualization flexibility.
    escape = FALSE,
    digits = 2,
    caption = "Variance Inflation Factors (VIF)",
    col.names = c("Variable", "VIF", "Multicollinearity"),
    align = c("l", "c", "c"),
    row.names = FALSE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"), # Make the large Kable visually compact.
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#21918c", align = "c") %>%
  footnote(general = "VIF > 10 indicates high multicollinearity.\nVIF > 5 indicates moderate concern.\nAge polynomials are mathematically related and negligible.",
           general_title = "Note:")
Variance Inflation Factors (VIF)
Variable VIF Multicollinearity
Average Building Age 46.10 High
Average Building Age² 46.05 High
% Business/Commercial 1.21 Low
Building Density 1.21 Low
% Park/Open Space 1.11 Low
Distance to Outage Hotspots 1.09 Low
Vacancy Count 1.07 Low
Distance to Major Street 1.06 Low
% Downtown 1.06 Low
% Industrial 1.05 Low
Note:
VIF > 10 indicates high multicollinearity.
VIF > 5 indicates moderate concern.
Age polynomials are mathematically related and negligible.

5. Section Analysis

Part IV built the statistical count models, Poisson and Negative Binomial, which are meant to handle response variables that are counts, like in this case. Poisson was the initial model crafted, and after seeing that dispersion was an issue, the Negative Binomial approach was pursued, much like determining spatial dependence with Moran’s I on OLS regressions to pursue alternative spatial regressions.

The first table exhibits the summary statistics of the final variables used for the regression models.

However, the more important aspect of the section is the interpretation of the models. And the following discusses the Poisson output:

  • Being farther from an outage hotspot decreases burglaries.

  • More vacant buildings are associated with more burglaries.

  • Being nearby at least 3 stations is very insignificant, but is associated with an increase in burglaries.

  • Being near a major street is also associated with an increase in burglaries, significantly more than stations, but it might be because the major streets are already capturing what the stations are capturing when it comes to access and escape options for burglaries.

  • Higher building density is associated with more burglaries, and this is the most significant of all the predictors. This is likely due to the fact there are both more and a variety of potential burglary targets.

  • Building age is positive but the squared age is negative, so older buildings are high risk for burglary. Older buildings could have less security and structural damage that makes them easy to enter, but that is up to a certain point. Historical buildings may be less targeted due to the fact that they may be under preservation and under tourist scrutiny for example.

  • Industrial and park zonings are less of targets, which is unsurprising because they likely don’t have anything of value to burglars, unlike residential, downtown, and business/commercial zonings.

  • Note that the intercept would be representing the residential zoning, so this means that residential areas are less likely to be burglary targets than business/commercial and downtown zoning, but more likely compared to parks and industrial zoning.

However, dispersion was greater than 1.5, meaning that it violated a Poisson assumption and inflated p-value significance. So a negative binomial was pursued afterward, and because the nn_stations variable was so insignificant in the Poisson model, it was removed from the negative binomial.

As for the negative binomial output, the relationship types remained the same and no signs flipped, so the general observations from the Poisson output above can be applied to the negative binomial output.

In the end, the difference in AIC was over 700, indicating that the negative binomial fit vastly better than the Poisson.

A VIF test was also conducted to look for multicollinearity, and there was none, reinforcing the predictor and coefficient reliability.

PART V: Spatial Cross-Validation (2017)

1. Leave-One-Group-Out (LOGO) Cross-Validation

# Get unique districts.
districts_unique <- unique(fishnet_model$dist_label)
cv_results_districts <- tibble()

for (i in seq_along(districts_unique)) {
  
  test_districts <- districts_unique[i]
  
  # Split data.
  train_data <- fishnet_model %>% filter(dist_label != test_districts)
  test_data <- fishnet_model %>% filter(dist_label == test_districts)
  
  # Fit model on training data.
  model_cv <- glm.nb(count_burglaries ~
                       # 311 data.
                       dist_to_hotspot_outages +
                       # Contextual count data.
                       count_vacant +
                       # Contextual spatial features.
                       dist_to_streets + building_density +
                       # Age polynomial.
                       avg_building_age + I(avg_building_age^2) +
                       # Zoning. Leave residential out as reference.
                       pct_Business_Commercial + pct_Downtown + pct_Industrial + pct_Park_Open_Space,
                     data = train_data,
                     )
  
  # Predict on test data.
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  # Calculate metrics.
  mae <- mean(abs(test_data$count_burglaries - test_data$prediction))
  rmse <- sqrt(mean((test_data$count_burglaries - test_data$prediction)^2))
  
  # Store results.
  cv_results_districts <- bind_rows( # Bind function to "stack" tibble below.
    cv_results_districts,
    tibble(
      test_district = test_districts,
      mae = mae,
      rmse = rmse
    )
  )
}
# Get unique neighborhoods
neighborhoods_unique <- unique(fishnet_model$community)
cv_results_neighborhoods <- tibble()

for (i in seq_along(neighborhoods_unique)) {
  
  test_neighborhoods <- neighborhoods_unique[i]
  
  # Split data.
  train_data <- fishnet_model %>% filter(community != test_neighborhoods)
  test_data <- fishnet_model %>% filter(community == test_neighborhoods)
  
  # Fit model on training data.
  model_cv <- glm.nb(count_burglaries ~
                       # 311 data.
                       dist_to_hotspot_outages +
                       # Contextual count data.
                       count_vacant +
                       # Contextual spatial features.
                       dist_to_streets + building_density +
                       # Age polynomial.
                       avg_building_age + I(avg_building_age^2) +
                       # Zoning. Leave residential out as reference.
                       pct_Business_Commercial + pct_Downtown + pct_Industrial + pct_Park_Open_Space,
                     data = train_data,
                     )
  
  # Predict on test data.
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  # Calculate metrics.
  mae <- mean(abs(test_data$count_burglaries - test_data$prediction))
  rmse <- sqrt(mean((test_data$count_burglaries - test_data$prediction)^2))
  
  # Store results.
  cv_results_neighborhoods <- bind_rows(
    cv_results_neighborhoods,
    tibble(
      test_neighborhood = test_neighborhoods,
      mae = mae,
      rmse = rmse
    )
  )
}

2. Cross-Validation Tables

# Scrollable Kable for districts.
cv_results_districts %>%
  arrange(desc(rmse)) %>%
  mutate(
    test_district = cell_spec(test_district, bold = TRUE),
    mae = round(mae, 2),
    rmse = round(rmse, 2)
  ) %>%
  kable(
    format = "html", # HTML formatting for visualization flexibility.
    escape = FALSE,
    caption = "Cross-Validation Results by Police Districts",
    col.names = c("Test District", "MAE", "RMSE"),
    align = c("l", "c", "c"),
    row.names = FALSE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"), # Make the large Kable visually compact.
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#21918c", align = "c") %>%
  scroll_box(height = "500px", width = "500px") %>%  # Make it scrollable.
  footnote(
    general = paste0("Mean MAE: ", round(mean(cv_results_districts$mae), 2),
                    "\nMean RMSE: ", round(mean(cv_results_districts$rmse), 2)),
    general_title = "Overall Performance:"
  )
Cross-Validation Results by Police Districts
Test District MAE RMSE
3RD 5.05 7.00
1ST 3.78 4.91
12TH 3.26 4.60
2ND 3.21 4.41
9TH 3.10 4.24
24TH 2.49 3.96
18TH 3.24 3.96
14TH 2.86 3.91
6TH 2.93 3.77
19TH 3.09 3.74
10TH 2.75 3.64
8TH 2.38 3.48
4TH 1.28 3.31
11TH 2.48 3.30
25TH 2.42 3.04
7TH 2.24 3.00
5TH 1.76 2.61
22ND 2.05 2.54
17TH 2.04 2.49
15TH 1.86 2.46
20TH 1.63 1.96
16TH 1.14 1.86
Overall Performance:
Mean MAE: 2.59
Mean RMSE: 3.55
# Scrollable Kable for neighborhoods
cv_results_neighborhoods %>%
  arrange(desc(rmse)) %>%
  mutate(
    test_neighborhood = cell_spec(test_neighborhood, bold = TRUE),
    mae = round(mae, 2),
    rmse = round(rmse, 2)
  ) %>%
  kable(
    format = "html", # HTML formatting for visualization flexibility.
    escape = FALSE,
    caption = "Cross-Validation Results by Neighborhoods",
    col.names = c("Neighborhood", "MAE", "RMSE"),
    align = c("l", "c", "c"),
    row.names = FALSE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"), # Make the large Kable visually compact.
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#21918c", align = "c") %>%
  scroll_box(height = "500px", width = "500px") %>%  # Make it scrollable.
  footnote(
    general = paste0("Mean MAE: ", round(mean(cv_results_neighborhoods$mae), 2),
                    "\nMean RMSE: ", round(mean(cv_results_neighborhoods$rmse), 2)),
    general_title = "Overall Performance:"
  )
Cross-Validation Results by Neighborhoods
Neighborhood MAE RMSE
SOUTH SHORE 8.43 11.24
WOODLAWN 7.28 9.11
WEST TOWN 6.50 7.87
WASHINGTON PARK 5.22 6.86
CHICAGO LAWN 4.67 5.73
NEW CITY 3.95 5.22
LOOP 4.16 5.20
SOUTH CHICAGO 3.36 5.08
HYDE PARK 3.32 5.01
GAGE PARK 3.53 4.64
NEAR NORTH SIDE 3.85 4.63
WEST ELSDON 4.08 4.61
NORTH CENTER 3.70 4.10
HERMOSA 3.35 4.06
GRAND BOULEVARD 3.33 4.05
NORTH LAWNDALE 3.00 4.00
AUBURN GRESHAM 3.08 4.00
WEST RIDGE 2.17 3.99
ROSELAND 3.01 3.91
HUMBOLDT PARK 3.42 3.85
ROGERS PARK 3.02 3.79
LAKE VIEW 3.09 3.78
AVONDALE 3.22 3.61
WEST ENGLEWOOD 2.73 3.59
CHATHAM 2.98 3.58
BEVERLY 3.40 3.57
PORTAGE PARK 3.25 3.53
BRIGHTON PARK 2.58 3.50
SOUTH LAWNDALE 2.65 3.49
EAST GARFIELD PARK 2.44 3.40
GREATER GRAND CROSSING 1.96 3.25
ARCHER HEIGHTS 2.54 3.25
MCKINLEY PARK 2.46 3.04
IRVING PARK 2.67 2.97
ASHBURN 1.85 2.93
BELMONT CRAGIN 2.34 2.91
NEAR SOUTH SIDE 2.39 2.90
ALBANY PARK 2.39 2.89
CALUMET HEIGHTS 2.22 2.86
LOWER WEST SIDE 2.27 2.85
JEFFERSON PARK 2.32 2.73
KENWOOD 2.47 2.65
AUSTIN 1.94 2.60
NEAR WEST SIDE 2.11 2.54
LINCOLN PARK 2.19 2.44
WASHINGTON HEIGHTS 1.94 2.44
MONTCLARE 1.95 2.36
ENGLEWOOD 1.81 2.35
AVALON PARK 1.96 2.28
PULLMAN 1.41 2.25
BURNSIDE 2.23 2.23
LOGAN SQUARE 1.82 2.23
WEST PULLMAN 1.78 2.15
DOUGLAS 1.73 2.07
MORGAN PARK 1.58 2.00
GARFIELD RIDGE 1.64 1.97
UPTOWN 1.56 1.96
DUNNING 1.44 1.92
WEST LAWN 1.39 1.91
BRIDGEPORT 1.49 1.88
LINCOLN SQUARE 1.47 1.83
EAST SIDE 1.34 1.80
NORWOOD PARK 1.53 1.78
EDGEWATER 1.40 1.72
CLEARING 1.27 1.61
MOUNT GREENWOOD 1.16 1.37
EDISON PARK 1.34 1.36
ARMOUR SQUARE 1.30 1.30
HEGEWISCH 0.59 1.24
FOREST GLEN 0.93 1.08
NORTH PARK 0.83 1.08
RIVERDALE 0.44 0.81
SOUTH DEERING 0.39 0.73
WEST GARFIELD PARK 0.41 0.54
OAKLAND 0.17 0.17
OHARE 0.05 0.15
Overall Performance:
Mean MAE: 2.46
Mean RMSE: 3.11

Police Districts (From CPD)
# Join RMSE values to districts.
districts_rmse <- police_districts %>%
  left_join(cv_results_districts,
            by = c("dist_label" = "test_district"))

# Join RMSE values to neighborhoods.
neighborhoods_rmse <- neighborhoods %>%
  left_join(cv_results_neighborhoods,
            by = c("community" = "test_neighborhood"))

# Create custom color ramp.
weitzman <- colorRampPalette(c("#778ac5", "#ff4100"))

# District max and min RMSE for labeling.
district_max <- districts_rmse %>%
  slice_max(rmse, n = 1)
district_min <- districts_rmse %>%
  slice_min(rmse, n = 1)

# Neighborhood max and min RMSE for labeling.
neighborhood_max <- neighborhoods_rmse %>%
  slice_max(rmse, n = 1)
neighborhood_min <- neighborhoods_rmse %>%
  slice_min(rmse, n = 1)

# Police district RMSE.
rmse_districts <- ggplot(districts_rmse) +
  geom_sf(aes(fill = rmse),
          color = "#f5f4f0",
          linewidth = 0.75
          ) +
  scale_fill_gradientn(
    colors = weitzman(256),
    name = "Police District RMSE",
    trans = "sqrt",
    breaks = c(1.88, 2.79, 3.55, 3.98, 5.00, 6.99)
    ) +
  geom_sf_label(
    data = district_max,
    aes(label = paste0(dist_label, ": ", round(rmse, 2))),
    fill = "#2d2a26", color = "#f5f4f0", size = 3, label.size = 0,
    fontface = "bold"
    ) +
  geom_sf_label(
    data = district_min,
    aes(label = paste0(dist_label, ": ", round(rmse, 2))),
    fill = "#2d2a26", color = "#f5f4f0", size = 3, label.size = 0,
    fontface = "bold"
    ) +
  labs(
    title = "Police Districts",
    subtitle = "Chicago, Illinois | Min and Max Districts"
    ) +
  theme_chi_map() +
  theme(
    plot.background  = element_rect(fill = "#f5f4f0", color = NA),
    panel.background = element_rect(fill = "#f5f4f0", color = NA)
  )

# Neighborhood RMSE.
rmse_neighborhoods <- ggplot(neighborhoods_rmse) +
  geom_sf(aes(fill = rmse),
          color = "#f5f4f0",
          linewidth = 0.75
          ) +
  scale_fill_gradientn(
    colors = weitzman(256),
    name = "Neighborhood RMSE",
    trans = "sqrt",
    breaks = c(0.13, 1.93, 2.86, 3.90, 6.00, 11.40)
    ) +
  geom_sf_label(
    data = neighborhood_max,
    aes(label = paste0(community, ": ", round(rmse, 2))),
    fill = "#2d2a26", color = "#f5f4f0", size = 3, label.size = 0,
    fontface = "bold"
  ) +
  geom_sf_label(
    data = neighborhood_min,
    aes(label = paste0(community, ": ", round(rmse, 2))),
    fill = "#2d2a26", color = "#f5f4f0", size = 3, label.size = 0,
    fontface = "bold"
  ) +
  labs(
    title = "Neighborhoods",
    subtitle = "Chicago, Illinois | Min and Max Neighborhoods"
    ) +
  theme_chi_map() +
  theme(
    plot.background  = element_rect(fill = "#f5f4f0", color = NA),
    panel.background = element_rect(fill = "#f5f4f0", color = NA)
  )

# Combine plots using patchwork.
rmse_districts + rmse_neighborhoods +
  plot_annotation(
    title = "Spatial Distribution of LOGO Cross-Validation",
    subtitle = "Root Mean Squared Error (RMSE)",
    theme = theme(
      plot.title = element_text(
        size = 16, 
        face = "bold",
        hjust = 0.5
        ),
      plot.subtitle = element_text(
        size = 14,
        face = "italic",
        hjust = 0.5,
        color = "gray30"
        ),
      plot.background  = element_rect(fill = "#f5f4f0", color = NA),
      panel.background = element_rect(fill = "#f5f4f0", color = NA),
      )
    )

3. Section Analysis

To explain Part III-4 why the police districts and neighborhoods were joined to the fishnet: The goal, again, is to make this predictive model as “ethical” as possible. Even if testing folds is majorly for technical purposes, testing folds over police districts implies that the predictive model is being geared toward policing rather than neighborhoods, which has more implications for community-building and resource/service allocation.

The cross-validation with the 2017 data was used to evaluate how robust the model is using a Leave-One-Group-Out (LOGO) method instead of the classic k-fold cross-validation. The importance of this is to remove either a district or neighborhood out, train on the other districts, then predict that held-out district to determine the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

Also, according to the CV tables, the three highest RMSEs are districts 3, 1, and 12, and neighborhoods South Shore, Woodlawn, and West Town. The districts with the highest errors are in the Central, South Side, and West Sides, and the neighborhoods with the highest errors are within West Side and South Side. The biggest overlap is especially in South Side were the South Shore, Woodlawn, and 3rd District overlap around the waterfront area.

On the other end of that spectrum, the districts with the lowest RMSEs are 16, 20, and 15, and neighborhoods with the lowest errors are O’Hare, Oakland, and West Garfield Park. These regions are in Far North Side where the airport is, which is expected, as well as West Side when looking at districts. However, looking at neighborhoods, the ones that performed the best are Oakland in South Side, West Garfield Park in West Side, and O’Hare in Far North Side, as previously explained.

District-scale vs. neighborhood-scale differ, this particular model performs a lot better with the neighborhoods being typically 2 to 3 burglaries off when it comes to predicting, as opposed to the districts being typically 3 burglaries off.

This could be due to a variety of factors, but the performance by differ due to the fact neighborhoods are more uniform when it comes to urban characteristics, which is exactly what the predictor variables are based on. It might have not performed as well for police districts because they tend to be delineated according to call volume and crime rates and how many resources police are able to dispatch to handle them (i.e. drawing police boundaries may not be as strongly influenced by urban form).

PART VI: Model Evaluation

1. Generate Final Predictions

# Fit final model on all data.
final_model <- glm.nb(count_burglaries ~
                        # 311 data.
                        dist_to_hotspot_outages +
                        # Contextual count data.
                        count_vacant +
                        # Contextual spatial features.
                        dist_to_streets + building_density +
                        # Age polynomial.
                        avg_building_age + I(avg_building_age^2) +
                        # Zoning. Leave residential out as reference.
                        pct_Business_Commercial + pct_Downtown + pct_Industrial + pct_Park_Open_Space,
                     data = fishnet_model
                     )

# Add predictions back to fishnet.
fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
  )

# Also add KDE predictions (normalize to same scale as counts).
kde_sum <- sum(fishnet$kde_value_burglaries, na.rm = TRUE)
count_sum <- sum(fishnet$count_burglaries, na.rm = TRUE)
fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value_burglaries / kde_sum) * count_sum
  )

2. Compare Model vs. KDE Baseline

# Create three maps.
actual_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = count_burglaries),
          color = NA
          ) +
  scale_fill_viridis_c(name = "Actual and Predicted Count",
                       option = "rocket",
                       direction = -1,
                       limits = c(0, 15)
                       ) +
  labs(title = "Actual Burglaries") +
  theme_chi_map()

model_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = prediction_nb),
          color = NA
          ) +
  scale_fill_viridis_c(name = "Actual and Predicted Count",
                       option = "rocket",
                       direction = -1,
                       limits = c(0, 15)
                       ) +
  labs(title = "Model Predictions (Negative Binomial)") +
  theme_chi_map()

kde_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = prediction_kde),
          color = NA
          ) +
  scale_fill_viridis_c(name = "Actual and Predicted Count",
                       option = "rocket",
                       direction = -1,
                       limits = c(0, 15)
                       ) +
  labs(title = "KDE Baseline Predictions") +
  theme_chi_map()

actual_map + model_map + kde_map +
  plot_annotation(
    title = "Actual vs. Predicted Burglaries",
    subtitle = "Performance: Negative Binomial vs. Simple KDE",
    theme = theme(
      plot.title = element_text(
        size = 16, 
        face = "bold",
        hjust = 0.5
        ),
      plot.subtitle = element_text(
        size = 14,
        face = "italic",
        hjust = 0.5,
        color = "gray30"
        )
      )
  ) +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

# Calculate performance metrics.
comparison_2017 <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    NB_MAE = mean(abs(count_burglaries - prediction_nb)),
    NB_RMSE = sqrt(mean((count_burglaries - prediction_nb)^2)),
    KDE_MAE = mean(abs(count_burglaries - prediction_kde)),
    KDE_RMSE = sqrt(mean((count_burglaries - prediction_kde)^2))
  )

# Create Kable.
comparison_2017 %>%
  pivot_longer(everything(), names_to = "Metric", values_to = "Value") %>%
  separate(Metric, into = c("Approach", "Metric"), sep = "_") %>%
  pivot_wider(names_from = Metric, values_from = Value) %>%
  kable(
    format = "html", # HTML formatting for visualization flexibility.
    escape = FALSE,
    digits = 2,
    caption = "Model In-Sample Performance Comparison (2017)",
    align = c("l", "c", "c"),
    row.names = FALSE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
    ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#21918c", align = "c") %>%
  column_spec(1, bold = TRUE) %>%
  footnote(
    general = "Lower values are better.",
    general_title = "Note:"
  )
Model In-Sample Performance Comparison (2017)
Approach MAE RMSE
NB 2.11 3.21
KDE 1.98 2.90
Note:
Lower values are better.
# Calculate performance metrics.
comparison_2018 <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    NB_MAE = mean(abs(count_burglaries_2018 - prediction_nb)),
    NB_RMSE = sqrt(mean((count_burglaries_2018 - prediction_nb)^2)),
    KDE_MAE = mean(abs(count_burglaries_2018 - prediction_kde)),
    KDE_RMSE = sqrt(mean((count_burglaries_2018 - prediction_kde)^2))
  )

# Create Kable.
comparison_2018 %>%
  pivot_longer(everything(), names_to = "Metric", values_to = "Value") %>%
  separate(Metric, into = c("Approach", "Metric"), sep = "_") %>%
  pivot_wider(names_from = Metric, values_from = Value) %>%
  kable(
    format = "html", # HTML formatting for visualization flexibility.
    escape = FALSE,
    digits = 2,
    caption = "Model Out-of-Sample Performance Comparison (2018)",
    align = c("l", "c", "c"),
    row.names = FALSE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
    ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#21918c", align = "c") %>%
  column_spec(1, bold = TRUE) %>%
  footnote(
    general = "Lower values are better.",
    general_title = "Note:"
  )
Model Out-of-Sample Performance Comparison (2018)
Approach MAE RMSE
NB 1.97 2.96
KDE 2.02 2.97
Note:
Lower values are better.

3. Where Does the Model Work Well?

# Calculate errors.
fishnet <- fishnet %>%
  mutate(
    error_nb = count_burglaries - prediction_nb,
    error_kde = count_burglaries - prediction_kde,
    abs_error_nb = abs(error_nb),
    abs_error_kde = abs(error_kde)
  )
# NB map errors.
nb_error_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = error_nb),
          color = NA
          ) +
  scale_fill_gradient2(
    name = "Error",
    low = "#011f5b", mid = "white", high = "#990000",
    midpoint = 0,
    limits = c(-10, 10),
    breaks = c(-10, -5, 0, 5, 10)
    ) +
  labs(title = "NB Model Errors (Actual – Predicted)") +
  theme_chi_map()

# NB map absolute errors.
nb_abs_error_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = abs_error_nb),
          color = NA
          ) +
  scale_fill_viridis_c(name = "NB Absolute Error",
                       option = "rocket",
                       direction = -1,
                       breaks = c(0, 5, 10, 15, 20)
                       ) +
  labs(title = "NB Model Absolute Errors") +
  theme_chi_map()

# KDE map errors.
kde_error_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = error_kde),
          color = NA
          ) +
  scale_fill_gradient2(
    name = "Error",
    low = "#011f5b", mid = "white", high = "#990000",
    midpoint = 0,
    limits = c(-10, 10),
    breaks = c(-10, -5, 0, 5, 10)
    ) +
  labs(title = "KDE Model Errors (Actual – Predicted)") +
  theme_chi_map()

# KDE map absolute errors.
kde_abs_error_map <- ggplot() +
  geom_sf(data = fishnet,
          aes(fill = abs_error_kde),
          color = NA
          ) +
  scale_fill_viridis_c(name = "KDE Absolute Error",
                       option = "rocket",
                       direction = -1,
                       breaks = c(0, 5, 10, 15, 20)
                       ) +
  labs(title = "KDE Model Absolute Errors") +
  theme_chi_map()

(nb_error_map + nb_abs_error_map) / (kde_error_map + kde_abs_error_map) +
  plot_annotation(
    title = "Model Errors",
    subtitle = "Negative Binomial (NB) vs. Kernel Density Estimation (KDE)",
    theme = theme(
      plot.title = element_text(
        size = 16, 
        face = "bold",
        hjust = 0.5
        ),
      plot.subtitle = element_text(
        size = 14,
        face = "italic",
        hjust = 0.5,
        color = "gray30"
        )
      )
    ) +
  plot_layout(guides = "collect") &
  theme(legend.position = "bottom")

4. Model Summary Table

# Create final summary table.
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
  mutate(
    across(where(is.numeric), ~round(., 3))
  )

# Rename to make the table more neat and professional.
model_summary <- model_summary %>%
  mutate(term = case_when(
    term == "(Intercept)" ~ "Intercept",
    term == "I(avg_building_age^2)" ~ "Average Building Age²",
    term == "count_vacant" ~ "Vacancy Count",
    term == "dist_to_streets" ~ "Distance to Major Street",
    term == "building_density" ~ "Building Density",
    term == "I(avg_building_age^2)" ~ "Average Building Age²",
    term == "avg_building_age" ~ "Average Building Age",
    term == "pct_Business_Commercial" ~ "% Business/Commercial",
    term == "pct_Downtown" ~ "% Downtown",
    term == "pct_Industrial" ~ "% Industrial",
    term == "pct_Park_Open_Space" ~ "% Park/Open Space",
    term == "dist_to_hotspot_outages" ~ "Distance to Outage Hotspots",
    TRUE ~ term
    )
  )

model_summary %>%
  kable(
    format = "html", # HTML formatting for visualization flexibility.
    escape = FALSE,
    digits = 2,
    caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
    col.names = c("Variable", "Rate Ratio", "Standard Error", "Z-Score", "P-Value"),
    align = c("l", rep("c", 4)),
    row.names = FALSE
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
    ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#21918c", align = "c") %>%
  column_spec(1, bold = TRUE) %>%
  footnote(
    general = "Rate Ratios > 1 have a positive association with burglary counts.",
    general_title = "Note:"
  )
Final Negative Binomial Model Coefficients (Exponentiated)
Variable Rate Ratio Standard Error Z-Score P-Value
Intercept 0.05 0.41 -7.18 0.00
Distance to Outage Hotspots 1.00 0.00 -8.20 0.00
Vacancy Count 1.01 0.00 3.56 0.00
Distance to Major Street 1.00 0.00 -1.89 0.06
Building Density 25.25 0.47 6.83 0.00
Average Building Age 1.08 0.01 8.31 0.00
Average Building Age² 1.00 0.00 -7.07 0.00
% Business/Commercial 3.47 0.26 4.71 0.00
% Downtown 2.33 0.35 2.42 0.01
% Industrial 0.44 0.27 -3.06 0.00
% Park/Open Space 0.57 0.28 -1.97 0.05
Note:
Rate Ratios > 1 have a positive association with burglary counts.

5. Section Analysis and Overall Discussion

Part VI evaluates the final model compared to the baseline KDE that was created earlier in Part II-4, the results are surprisingly closer than expected. Of course, it was expected that KDE would likely perform better with in-sampling due to more overfit compared to the final NB model, which ended up performing better out-of-sample compared to the KDE model.

The differences in MAE and RMSE are more prominent when looking at the in-sample table, and the out-of-sample table values have very marginal differences, but in the end the NB model does generalize better by a slim amount. This can be observed when looking at the faceted map and how they have, visually, extremely close spatial pattern overlap.

It is also crucial to address the practical use of this prediction model: it has potential and can be useful under human supervision, and this is of an opinion that is cemented, regardless of how well a model does. The oversight is needed even if it performs spatially and temporally well, because the model can be used to disperse resources to communities, but may over- or under-predict still; this means it can help detect hotspots, serving as an early warning for urban planners to consider Chicago’s future form.

It can be argued that this lab’s goal striving for the least damage is contradicted by the simple fact that this is a burglary predictive model, but it can also be argued that the variables themselves are related to the built environment only. So, the tool can instead aid public servants to take a deeper look into alternative, urban planning or local, citizen-driven solutions to alleviate forced entry burglaries rather than sending police to communities that are already intensely scrutinized.